MASSACHUSETTS INSTITUTE OF TECHNOLOGY 
ARTIFICIAL INTELLIGENCE LABORATORY 

and 


CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING 
DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES 


A.I. Memo No. 1571 March, 1996 

C.B.C.L. Memo No. 136 


Computing upper and lower bounds on 
likelihoods in intractable networks 


Tommi S. Jaakkola* and Michael I. Jordan* 

{tommi,j ordan}@psyche.mit.edu 

This publication can be retrieved by anonymous ftp to publications.ai.mit.edu. 


Abstract 

We present techniques for computing upper and lower bounds on the likelihoods of partial instantiations 
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We illustrate the tightness of the obtained bounds by numerical experiments. 
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1 Introduction 

A graphical model provides an explicit representation of 
qualitative dependencies among the variables associated 
with the nodes of the graph (Pearl, 1988). Assigning 
values (potentials or probability tables) to the links con¬ 
necting the variables in these models enables numerical 
(or quantitative) computation of beliefs about the values 
of the variables on the basis of acquired evidence. The 
computations involved, i.e., propagation of beliefs, can 
be handled by now standard exact methods (Lauritzen 
& Spiegelhalter, 1988, Jensen et al. 1990). Junction 
trees serve as representational platforms for these ex¬ 
act probabilistic calculations and are constructed from 
directed graphical representations via moralization and 
triangulation. Although powerful in utilizing the struc¬ 
ture of the underlying networks, junction trees may, in 
some cases, contain cliques that are prohibitively large. 
We focus in this paper on methods for dealing with such 
large (sub)structures. 

Large clique sizes lead not only to long execution 
times but also involve exponentially many parameters 
that must be assessed or learned. The latter issue is gen¬ 
erally addressed via parsimonious representations such 
as the logistic sigmoid (Neal, 1992) or the noisy-OR func¬ 
tion (Pearl, 1988). We consider both of these represen¬ 
tations in the current paper. We stay within a directed 
framework and thereby retain the compactness of these 
representations throughout our inference and estimation 
algorithms. 

As an alternative to sampling methods in intractable 
networks we develop principled approximations by com¬ 
puting upper and lower bounds on likelihoods of par¬ 
tial instantiatiations of variables. Such bounds can be 
combined to give rise to confidence intervals for the de¬ 
sired likelihoods (e.g. for node marginals). Although the 
problem of finding confidence intervals to a predescribed 
accuracy is NP-hard (Dagum and Luby, 1993), bounds 
that can be computed efficiently may nevertheless yield 
confidence intervals that are accurate enough to be use¬ 
ful in practice. 

Saul et al. (1996) derived a rigorous lower bound 
for sigmoid belief networks and we complete the picture 
here by developing the missing upper bounds for sigmoid 
networks. We also develop both upper and lower bounds 
for noisy-OR networks. While the lower bounds we ob¬ 
tain are applicable to generic network structures, the 
upper bounds are currently restricted to two-level net¬ 
works. Although a serious restriction, there are nonethe¬ 
less many potential applications for such upper bounds, 
including the probabilistic reformulation of the QMR 
knowledge base (Shwe et al., 1991). We emphasize fi¬ 
nally that our focus in this paper is on techniques of 
bounding rather than on all-encompassing inference al¬ 
gorithms; tailoring the bounds for specific problems or 
merging them with exact methods may yield a consider¬ 
able advantage. 

The paper is structured as follows. Section 2 intro¬ 
duces sigmoid belief networks, develops the techniques 
for upper and lower bounds, and gives preliminary nu¬ 
merical analysis of the accuracy of the bounds. Section 
3 is devoted to the analogous results for noisy-OR net¬ 


works. In section 4 we summarize the results and de¬ 
scribe some future work. 


2 Sigmoid belief networks 

Sigmoid belief networks are (directed) probabilistic net¬ 
works defined over binary variables Si , . . ., S n . The joint 
distribution for the variables has the usual decomposi- 
tional structure: 

P(5 1 ,...,5„|(9) = n^(5i|pa[i],(?) (1) 


The conditional probabilities, however, take a particular 
form given by 


-P(Sjjpa[i], 9) = 

= ( 2 ) 

= ( 3 ) 


where g(x) = 1/(1 + exp(— x)) is the logistic function 
(also called a “sigmoid” function based on its graphical 
shape; see Figure 3). The parameters specifying these 
conditional probabilities are the real valued “weights” 
6ij. We note that the choice of this dependency model 
is not arbitrary but is rooted in logistic regression in 
statistics (McCullagh & Nelder, 1983). Furthermore, 
this form of dependency corresponds to the assumption 
that the odds from each parent of a node combine mul- 
tiplicatively; the weights Oij in this interpretation bear 
a relation to log-odds. 

In the remainder of this section we present techniques 
for computing upper and lower bounds on the likelihood 
of any instantiation of variables in sigmoid networks. We 
note that the upper bounds are restricted to two-level 
(bipartite) networks while the lower bounds are valid for 
arbitrary network structures. 


2.1 Upper bound for sigmoid network 

We restrict our attention to two-level directed architec¬ 
tures. The joint probability for this class of models can 
be written as 

P(S 1 ,...,S„|0) = n 

i£L i 

x n ^-i^) w 

j £L 2 

where L\ and L 2 signify the two layers of a bipartite 
graph with connections from L 2 to L 

To compute the likelihood of an instantiation of vari¬ 
ables in these networks, we note that (i) any instantiated 
variables in layer L 2 only reduce the complexity of the 
calculations, and (ii) the form of the architecture makes 
any unmstantiated variables in L\, or the “receiving” 
layer, inconsequential. We will thus adopt a simplifying 
notation in which the evidence consists of all and only 
the variables in L Thus, the goal is to compute 

P({Si] ieLi m= £ P(S u ...,S n \0) (5) 

{ G } j £ L 2 

Given our assumption that computing the likelihood 
is intractable, we seek an upper bound instead. Let us 



briefly outline our strategy. The goal is to simplify the 
joint distribution such that the marginalization across 
L 2 can be accomplished efficiently, while maintaining at 
all times a rigorous upper bound on the likelihood. Our 
approach is to introduce additional parameters into the 
problem (known as “variational parameters”) such that 
the resulting joint probability distribution factorizes over 
the uninstantiated variables. Thus we first find a “vari¬ 
ational” form for the joint distribution. Although the 
variational forms are exact they can be turned into upper 
bounds by not carrying out the minimizations involved 
and instead fixing the variational parameters. As we 
will see below this type of variational bound can be ob¬ 
tained by combining variational representations for each 
sigmoid function in our probability model. We note fi¬ 
nally that the variational parameters that are kept fixed 
during the likelihood calculation can be employed after¬ 
wards to optimize the likelihood bound. In essence, this 
amounts to exchanging the order of the summation over 
the uninstantiated variables and the variational mini¬ 
mization. 

To derive the upper bound we first make use of the 
following variational transformation of the sigmoid func¬ 
tion (see appendix A): 

g(x) = - = min e^ x ~ H ^ (6) 

1 + e~ x ee[o,i] v ; 


where H(-) is the binary entropy function. Inserting this 
transformation into the probability model we find 


P(S 1 ,...,S n \0) = 


min •; e 
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where we have pulled the minimizations outside and 
combined the terms that depend on each of the unin¬ 
stantiated variables Sj in L 2 . This reorganization 
shows that P(Si, . . ., Sjjl#,^) (defined implicitly) factor¬ 
izes over {Sj}jgL 2 - A simple upper bound on the likeli¬ 
hood is thus obtained in closed form by exchanging the 
order of the summation and the minimization: 


P({,S;-} 8eil \0)= Y p (Si,...,S n \0) (9) 

{ Sj } j € L 2 

= Y min{P(S 1 ,...,S„|0,O} (10) 

{ Sj } j e l 2 

< min 'Y P(s 1 ,...,s n \e,o ( 11 ) 

{Sj}jeL 2 
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e l 


n ( v e.(2s,-i)e„ 
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( 12 ) 

We state here a few facts about the bound (mostly with¬ 
out proof): (i) The bound can never be greater than one 
since one is always achieved by setting all £ to zero, (ii) 
the bound becomes exact in the limit of small parame¬ 
ter values, and (iii) for fixed prior probabilities P(Sj\8j) 
the bound has a lower limit and therefore cannot follow 
closely the true likelihood for very improbable events. 

To simplify the minimization with respect to £ we 
can work on a log scale and make use of the following 
Legendre transformation: 

log x = min{A* — log A — 1} (13) 

As a result we get 



logics,-} I<?)<-£ H&) 

i£L 1 

E , ( Y' £i(2Si — l)0ij 

A j I p(s i =i|0 i )e z ^ €I 'i +p(s i =o|e i ) 

j 2 

+ X 1 [ _ 1 o 8 A 2 _1 ] ( 14 ) 

j £L 2 

where we have ceased to indicate explicitly that the 
bound will be minimized over the adjustable parame¬ 
ters. This new form of the bound has the advantage 
that the minimization with respect to each parameter 
(£ or A) is reduced to convex optimization 1 and can be 
done by any standard method (e.g. Newton-Raphson). 
Note that the accuracy of the bound is not compromised 
by the additional Legendre transformation. Its effect is 
merely to simplify the expressions for optimization. 

2.2 Generic lower bound for sigmoid network 

Methods for finding approximate lower bounds on like¬ 
lihoods were first presented by Dayan, et al. (1995) and 
Hinton, et al. (1995) in the context of a layered network 
known as the “Helmholtz machine.” Saul, et al. (1996) 
subsequently showed how such bounds could be made 
rigorous (by appeal to mean Held theory) in the case of 
generic sigmoid networks. Unlike the method for obtain¬ 
ing upper bounds presented in the previous section, the 
lower bound methodology poses no constraints on the 
network structure. We briefly introduce the idea here 
(for details see Saul, et al.). 

Let us denote the set of instantiated variables by 
A lower bound on the (log) likelihood can be 
found directly via Jensen’s inequality: 

log P({Pi}ieL\0) = 

= log Y P(Si,---,S n \8) 

J The convexity with respect to each £ follows from the 
convexity of e x and the positivity of the multiplying coeffi¬ 
cients A. 
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which holds for any distribution Q over the uninstan¬ 
tiated variables {S'}. The: bound becomes exact if 
QdS 1 }) can represent the true posterior distribution 
{S'} | {Sjiei, 0). For other choices of Q the accuracy 
of the bound is characterized by the Ivullback-Leibler 
distance between Q and the posterior. As we are assum¬ 
ing that computing the likelihood exactly is intractable 
the idea is to find a distribution Q that can be com¬ 
puted efficiently. The simplest of such distributions is 
the completely factorized ( “mean Held” ) distribution: 

did!! 11/'- M I' ] ' 

i 

Inserting this distribution into the lower bound (eq. 
(15)) we can, in principle, carry out the summation 2 and 
get an expression for the lower bound. Consequently, the 
adjustable parameters fti can be modified to make the 
bound tighter. 

For later utility we rewrite the lower bound in eq. (15) 
as 

logP({S f } feL |0) 

> E Q {\ogP(S l ,...,S n \9) } + H q (17) 

= } + ( 18 ) 

i 

where Hq is the entropy of the Q distribution and Eq{-} 
is the expectation with respect to Q. We note finally 
that developing the bound further is highly dependent 
on the type of the network - whether sigmoid, noisy-OR, 
or other 3 . 


2.3 Numerical experiments for sigmoid 
network 

In testing the accuracy of the developed bounds we used 
8^8 networks (complete bipartite graphs), where the 
network size was chosen to be small enough to allow ex¬ 
act computation of the true likelihood for purposes of 
comparison. The method of testing was as follows. The 
parameters for the 8^8 networks were drawn from a 
Gaussian prior distribution and a sample from the result¬ 
ing joint distribution of the variables was generated. The 
variables in the “receiving” layer of the bipartite graph 
were instantiated according to the sample. The true like¬ 
lihood as well as the upper and lower bounds were com¬ 
puted for the instantiation. The resulting bounds were 
assessed by employing the relative error in log-likelihood, 
i.e. (logRBound/logi 3 — 1), as a measure of accuracy. 

2 The summation even in case of simple factorized distri¬ 
butions can be non-trivial to perform; see Saul, et al. 

“For a derivation of lower bounds for networks with cu- 
mulants replacing the sigmoid function see Jaakkola et al. 
(1996). 


More precisely, the prior distribution over the param¬ 
eters was taken to be 


m=n n 

* jepa[«] 


1 

e 

V 27T(7 2 


2 

ij 


(19) 


where the overall variance a 2 allows us to vary the degree 
to which the resulting parameters make the two layers of 
the network dependent. For small values of a 2 the lay¬ 
ers are almost independent whereas larger values make 
them strongly interdependent. To make the situation 
worse for the bounds 4 we enhanced the coupling of the 
layers by setting P(Sj\8j) = 1/2 for all the uninstanti¬ 
ated variables, i.e., making them maximally variable. 

In order to make the accuracy of the bounds commen¬ 
surate with those for the noisy-OR networks reported 
below, we summarize the results via a measure of inter¬ 
layer dependence. This dependence was measured by 

a st d = \/Var {.P( Sj; | pa{*])} (20) 

that is, the variability of the likelihood of the instantia¬ 
tion due to different configurations of the uninstantiated 
variables, Figure 1 illustrates the accuracy of the bounds 
as measured by the relative log-likelihood as a function 
of <7 std 5 - In terms of probabilities, a relative error of e 
translates into a P 1+£ approximation of the, true likeli¬ 
hood P. Note that the relative error is always positive for 
the upper bound and negative for the lower bound. The 
figure indicates that the bounds are accurate enough to 
be useful. In addition, we see that the the upper bound 
deteriorates faster with increasingly coupled layers. 


0.1 

0.08 

g 0.06 
o 

-C 

Jj 0.04 

05 

O 0.02 
c 

o 0 

<D 

| -° 02 
a> 

CC -0.04 
-0.06 
-0.08 
- 0.1 

0.05 0.1 0.15 0.2 0.25 

Stdv 

Figure 1: Accuracy of the bounds for sigmoid networks. 
The solid lines are the median relative errors in log- 
likelihood as a function of The upper and lower 

curves correspond to the upper and lower bounds re¬ 
spectively. 



4 Both the upper and lower bounds are exact in the limit 
of lightly coupled layers. 

“Note that the maximum value for cr st d is 1/2. 
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3 Noisy-OR networks 

Noisy-OR networks - like sigmoid networks - can be 
represented by DAGs and are written as a product form 
for the joint distribution: 

P(5 1 ,...,5„|(9) = nP(5 i |pa[i],(9) (21) 

i 

Unlike sigmoid networks, however, the conditional prob¬ 
abilities for a noisy-OR network are defined as 

P(Si|pa[i],0) = (l- n 

ie pap] 

n ( 22 ) 

jGpap] / 




where, for example, the parameter qij corresponds to 
the probability that the j th parent of i alone can turn 
Si on. A constant “leak” or “bias” can be included by 
introducing a dummy (parent) variable whose value is 
always fixed to one. 

In the following two sections we develop methods for 
computing upper and lower bounds on the likelihood 
of any instantiation of variables in the noisy-OR net¬ 
work. Similarly to the case of sigmoid networks the up¬ 
per bound is applicable to a restricted class of networks 
while the lower bound remains generic. For clarity of the 
forthcoming derivations we introduce the notation: 

P(Si = 0|pa[i], 6) = (1 -qijf 3 

ie pap] 

= e'^epM.]^ (23) 

with Oij = — log(l — qij) > 0. 

3.1 Upper bound for noisy-OR network 

The motivation and, in broad outline, the upper bound 
derivation itself can be carried over from the sigmoid 
setting to the noisy-OR case. 

Consider a two-level or bipartite network with 
{S)}; eil and {S)}; ei2 (where L 2 —> L i) denoting the 
two sets of variables. As before we adopt a simplify¬ 
ing notation in which an instantiation consists of values 
for all the variables in the layer L i. To compute the 
likelihood of such an instantiation we need to sum the 
noisy-OR joint distribution, 

p(s 1 ,...,s n \e) = 

= U (! _ e -E/-Cy, e -(i-s,)E/-C 

i£L i 

x n ^-i^) ( 24 ) 

j £L 2 

over the uninstantiated variables in L 2 . We note that 
the complexity of performing this calculation exactly in¬ 
creases exponentially with the number of variables that 
are instantiated to one; importantly, and unlike in the 
sigmoid case, the complexity does not vary exponentially 


with the number of uninstantiated variables. Neverthe¬ 
less, we focus on the case where the exact method of 
obtaining the likelihood is infeasible. 

To find an upper bound in the noisy-OR setting we use 
the following variational transformation (for a derivation 
and discussion see appendix B) 

1 — e~ x = mine^ _F( -^ (25) 

f>o 

where _F(£) = —<flog<J; + (<f + l)log(^ + 1). By inserting 
this transformation into the joint distribution we obtain: 


P(S 1 ,...,S n \0) = 

= n 


mm < <GG-ne.)] 1 e -(i-s,)E„ 


(26) 


i£L i 

x n p ( s ^ 

j 2 


mm s e 

e 


n 

j €.L 2 




def 


mm{ P(S 1 ,...,S„|0,O} 


(27) 


(28) 


where we have regrouped terms by rewriting the prod¬ 
uct over i £ L\ as a sum in the exponent and collecting 
the terms depending on the uninstantiated variables Sj. 
We can see that the implicitly defined (and unnormal¬ 
ized) P(Si , . . ., S n \9, £) factorizes over Sj. This factorial 
property allows us to find a closed form upper bound on 
the likelihood: 


P({S t } teLl \0) = 

= Y, P(Si,...,S n \0) 


{ Sj } j € L 2 




'Y J m i n 

{Sj}j<=L 2 

P(Si,- 

■■,S n \9,0 

(29) 

min 

^ r r» 1 

P(Si,- 

■■,S n \9,0 

(30) 


{ d j £ L 2 


where the last summation can now be performed exactly 
to yield: 

P({S t } teLl |0)< 

. f -V s t FU ,) 
mm < e ^‘ ei i x 

e l 


j 2 


n ( p (S3=i\8j)e 


J2, €L (Siti + Si-l)9 h 


+p(s j =o\e j ) 


(31) 


This bound (i) always stays below (or equal to) one as 
it is less than or equal to one whenever all £ are set to 
zero, and (ii) is exact when all Si in L\ are zero or in 
the limit of vanishing parameters Oij. 
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As in the sigmoid case we may simplify the minimiza- and we may use the smooth convexity properties of 
tion process by considering log _P({5'j'} 8 'g£ 1 |0) and intro- — log(l + x) (for x £ [0, 1]) to bring the expectations 

ducing a Legendre transformation for log(-). This yields: in eq. (36) inside the log. This results in 


log P({Si} i€L x \0) < £ SiF&) 

i£L i 

E , f T {Si£i + Si-1)6 a \ 

Xj I p(s i =i|e i )e z -'6 i i +p(s i =o|e i )j 

j€L 2 V ' 

+ £ [— log Xj - 1] (32) 

j £L 2 

where we have dropped the explicit reference to mini¬ 
mization. The gain again is the convexity of the bound 
with respect to any of the £ or A variables. 

3.2 Generic lower bound for noisy-OR network 

The earlier work on lower bounds by Saul, et al. was re¬ 
stricted to sigmoid networks; we extend that work here 
by deriving a lower bound for generic noisy-OR net¬ 
works. We refer to section 2.2 for the framework and 
commence from the noisy-OR counterpart of eq. (18). 
Thus, 

logP({Si} ieL |0) 

> £A Q {logR(5 8 |pa[i],(?)} + fLQ (33) 

i 

= £ e q{ s i log(l — e~ j } 

i 

+ £) E Q { -(1 - 5,-) £ OijSj } + Hq (34) 

i j 

which is obtained by writing explicitly the form of the 
conditional probabilities for noisy-OR networks. While 
the second expectation in eq. (34) simply corresponds 
to replacing the binary variables 5, with their means 
(since Q is factorized), the first expectation lacks a closed 
form expression. To compute this expectation efficiently 
we make use of the following expansion: 

OO 

l-e~ x = X\g(2 k x) (35) 

k = 0 

where g(-) is the sigmoid function (see appendix C). This 
expansion converges exponentially fast and thus only a 
few terms need to be included in the product for good 
accuracy. By carrying out this expansion in the bound 
above and explicitly using the form of the sigmoid func¬ 
tion we get 

logf({.S}ici|0) 

i k 

-£(l-w)£te+^0 (36) 

i j 


log P({Si} i€L \6) 

> £ log 1 + + 1 — A*i) 

ik |_ j 

-£( 1 -W)£^'W + ^Q (37) 

i j 

A more sophisticated and accurate way of computing the 
expectations in eq. (36) is discussed in appendix D. 

3.3 Numerical experiments for noisy-OR 
network 

The method of testing used here was, for the most part, 
identical to the one presented earlier for sigmoid net¬ 
works (section 2.3). The only difference was that the 
prior distribution over the parameters defining the con¬ 
ditional probabilities was chosen to be a Dirichlet instead 
of a Gaussian: 

Qij ~ n(l - g 8 'j) n_1 (38) 

(recall that P(Si = O|pa[i],0) = n^pa^ 1 “ < Uj) Sj )- For 
large n, q stays small (or 1 — q ps 1) and the layers of 
the bipartite network are only weakly connected; smaller 
values of n, on the other hand, make the layers strongly 
dependent. We thus used n to vary (on average) the in¬ 
terdependence beween the two layers. To facilitate com¬ 
parisons with the bounds derived for sigmoid networks 
we used a s td (see eq. (20)) as a measure of dependence 
between the layers. 

Figure 2 illustrates the accuracy of the computed 
bounds as a function of <J s td 6 ■ The samples with zero 
relative error are from the upper bound in cases where 
all the instantiated variables are zero since the bound 
becomes exact whenever this happens. The lower bound 
is slightly worse than the one for sigmoid networks most 
likely due to the symmetry and smoother nature of the 
sigmoid function. As with the sigmoid networks the up¬ 
per bound becomes less accurate more quickly. 

4 Discussion and future work 

Applying probabilistic methods to real world inference 
problems can lead to the emergence of cliques that are 
prohibitively large for exact algorithms (for example, in 
medical diagnosis). We focused on dealing with such 
large (sub)structures in the context of sigmoid belief 
networks and noisy-OR networks. For these networks 
we developed techniques for computing upper and lower 
bounds on the likelihoods of partial instantiations of vari¬ 
ables. The bounds serve as an alternative to sampling 
methods in the presence of intractable structures. They 
can define confidence intervals for the likelihoods and 
can be used to improve the accuracy of decision making 
in intractable networks. 


Now, as the parameters are non-negative, 


— 2 k 
e 



[ 0 , 1 ] 


6 The slight unevenness of the samples are due to the non¬ 
linear relationship between the Dirichlet parameter n and 

O’std- 


5 



0.1 

0.08 

-g 0.06 
o 

-C 

Jj 0.04 

05 

° 0.02 

/ 

o 0 -■^ • r.r-r^ z zr —.. .. .. 

<D 'r... 

J - 0.02 - ' 

"(5 -.-. ■ ■C’-N;:, ' 

CC -0.04 - .. "Wj .-r*' ■ ;;■ .- - 

-0.06- 

-0.08 - 

- 0.1 - 1 - 1 - 1 - 1 - 

0.05 0.1 0.15 0.2 0.25 

Stdv 

Figure 2: Accuracy of the bounds for noisy-OR net¬ 
works. The solid lines are the median relative errors 
in log-likelihood as a function of <r s td- The upper and 
lower curves correspond to the upper and lower bounds 
respectively. 



Toward extending the work presented in this paper 
we note that both the upper and lower bounds can be 
improved by considering a mixture paritioning (Jaakkola 
& Jordan, 1996) of the space of uninstantiated variables 
instead of using a completely factorized approximation. 
Furthermore, the restriction of the upper bounds for two- 
level networks can be overcome, for example, by inter¬ 
lacing them with sampling techniques, although other 
extensions may be possible as well. Following Saul A 
Jordan (1996) we may also merge the obtained bounds 
with exact methods whenever they are feasible. 
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A Sigmoid transformation 

Here we derive and discuss the following transformation: 

g(x) = -= min e G--HtO 

1 + e • F e e [ 0 , l ] 


Although a proof by hindsight would be shorter than a 
direct derivation we present the derivation for it is more 
informative. To this end, let us switch to log scale and 
consider 


— log( 1 Te ' F ) = - log e ~ 

m£{ 0,1} 


-log £ C( 1-0 

mg{0,l} 


1 — m 


eMi-o 1 - 


= - log E{ 


— m ^ 




< E{ — log- 


1 — m 


1 — m - 

fx -Klog£ + (l - £)log(i -0 
fx - H(f) 


which follows from interpreting £ m ( 1 — ^) 1_m as a proba¬ 
bility mass for m and from an application of Jensen’s in¬ 
equality. By actually performing the minimization over 
f gives f* = g(—x) and leads to an equality instead of 
a bound. The geometry of the bound when f is kept 
fixed for all x is illustrated in figure 3. The value of x 
for which the chosen f is optimal is the point where the 
bound is exact. 

We finally note that the above transformation can be 
understood as a type of Legendre transformation. 






Figure 3: Geometry of the sigmoid transformation. The 
dashed curve plots expj^i; — H(£)} as a function of x for 
a fixed £ (=0.5). 


Figure 4: Geometry of the noisy-OR transformation. 
The dashed curve gives expj^i; — F(£)} as a function 
of x when £ is fixed at 0.5. 


B Noisy-OR transformation 

Here we provide a derivation for the transformation 

l-e- F = mine^-^ 1 (39) 

£>1 




1 + e 2 ‘ r 

g{x)g{2x){l - e“ 4 ' F ) 


(41) 


presented in the text. Switching to log scale we find 


log( 1 - e ' F ) = - log 1 _ = - log 


, — kx 


k=0 


= -ik 1 


~ — kx 


k =0 


(1-9)9* 


„ — kx 


= ~ log E{ 


< E{— log- 


(1-9)9' 

0 — kx 


-} 


-} 


(i - q)q k 

oo oo 

= “ < l)<l kkx + “ 9)9*[log(l “ «) + fclo S 9] 


k =0 

9 


k=0 


1-9 


x + log( 1 - q) + 


1-9 


log 9 


where we have interpreted (1 — q)q k as a probability dis¬ 
tribution for k and used Jensen’s inequality. Minimizing 
the above bound with respect to q gives q* = e _,F and 
the bound becomes exact. The original transformation 
follows by setting £ = q/( 1 — q). If the value of £ is kept 
constant, the transformation yields a bound, the geom¬ 
etry of which is shown in figure 4. The point where the 
bound touches the 1 — e _,F curve defines x for which the 
constant £ is optimal. 

As in the sigmoid case the resulting transformation 
can be seen as a type of Legendre transformation. 


C Noisy-OR expansion 

The noisy-OR expansion 

OO 

1 — e_F =n^) 

k =0 

follows simply from 

1 _ e _, = (1 + c ~ F )(1 ~ c ~ F ) 
1 + e-- F 


(40) 
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and induction. For x > 0 the accuracy of the expansion 
is governed by 1 — e ~ 2 ,F which goes to one exponentially 
fast. Also since g(‘2 k 0) = 1/2, the expansion becomes 
(^) N at x = 0, where N is the number of terms included. 
As this approaches 1 — e~° = 0 exponentially fast, we 
conclude that the rapid convergence is uniform. Figure 
5 illustrates the accuracy of the expansion for small N. 



Figure 5: Accuracy of the noisy-OR expansion. Dotted 
line: N = 1, dashed line: N = 2, dotdashed: N = 3. N 
is the number of terms included in the expansion. 


D Quadratic bound 

For A” E [0, 1] we can bound — log(l + A') by a quadratic 
expression: 

— log(l + A') > a(X — x) 2 + b(X — x) + c (42) 

where c = — log(l + x), b = —1/(1 + x), and a = 
— [(1 — x)b + c + log2]/(l — x) 2 . The coefficents can 
be derived by requiring that the quadratic expression 
and it’s derivative are exact at A' = x, and by choosing 
the largest possible a such that the expression remains 
a bound. The resulting approximation is good for all 
x E [0, 1] and can be optimized by setting x = i?{A'}. 

Let us now use this quadratic bound in eq. (36) to 
better approximate the expectations. To simplify the 






ensuing formulas we use the notation 



3 

With these we straightforwardly find 

logP({S,} ieI |») 

ik 

+ ^2 Hi [bik(xl k) - xfb) + c ijfe ] 

ik 

-J2( 1 -^)J2 9 ^^+Hq (44) 

i 3 

(k) 

which is optimized with respect to x\ ; simply by setting 

x^) — x\ k \ The simpler bound in eq. (37) corresponds 
to ignoring the quadratic correction, i.e., using cqj, = 0 
above. 
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